This report analyzes the BostonHousing dataset, which
contains information about housing in Boston. We will explore the
relationships between several variables and the median home value
(medv).
We analyze the distribution of home prices (medv).
We compute the correlation matrix and display values associated with
medv.
## crim zn indus chas nox rm age
## -0.3883046 0.3604453 -0.4837252 0.1752602 -0.4273208 0.6953599 -0.3769546
## dis rad tax ptratio b lstat medv
## 0.2499287 -0.3816262 -0.4685359 -0.5077867 0.3334608 -0.7376627 1.0000000
Observations:
- Positive correlation with rm (number of
rooms).
- Negative correlation with lstat
(percentage of lower-status population).
- tax and crim have a negative
relationship with medv.
To get a first grasp of the data, and the different relationships
between variables, we can use the pairs function.
Even though there are a lot of modalities, we can see that there is a
relationship between rm and medv, a negative
relationship between lstat and medv and
finally crim and medv. So we will try to model
linear regressions between each of these variables.
## `geom_smooth()` using formula = 'y ~ x'
Here we can see a clear correlation between rm and
medv. Even though there are a few outliers. Also, based on
the description of the dataset on kaggle, some data scientists think
that the medv variable was censored since medv
often takes the value 50 which is also its maximum. This property makes
us question the reliability of this dataset, but anyway.
## `geom_smooth()` using formula = 'y ~ x'
Here we can se that the relationship between lstat and
medv is negative. The hypothesis would be that a higher
percentage of lower-status population will have a lower home value.
medv, crim and
taxObservations:
- crim is highly skewed: most
neighborhoods have low crime rates, but some areas have extreme
values.
- tax shows distinct peaks, suggesting
fixed tax rates in certain areas.
## `geom_smooth()` using formula = 'y ~ x'
Observation: Home prices decrease as crime rates increase.
## `geom_smooth()` using formula = 'y ~ x'
Observation: A higher tax rate is associated with lower home prices.
##
## Call:
## lm(formula = medv ~ crim + tax, data = BostonHousing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.005 -4.929 -2.099 2.945 33.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.379119 1.033523 30.361 < 2e-16 ***
## crim -0.186617 0.051156 -3.648 0.000292 ***
## tax -0.020018 0.002611 -7.667 9.15e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.036 on 503 degrees of freedom
## Multiple R-squared: 0.2396, Adjusted R-squared: 0.2366
## F-statistic: 79.27 on 2 and 503 DF, p-value: < 2.2e-16
Results:
- The coefficients for crim and tax
are negative, confirming their negative impact on
medv.
- p-values < 0.05 indicate these effects are
statistically significant.
To start, we can try to use the backward selection algorithm to find
the best linear regression model to predict medv. But
first, let’s try to add all the variables to the model to see what
happens.
##
## Call:
## lm(formula = medv ~ ., data = BostonHousing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## b 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Obviously, some modalities are not significant
(e.g. indus or age). So let’s try to use the
backward selection algorithm to find the best model.
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + b + lstat, data = BostonHousing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## b 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
We can see that the model gives us a low p-value of
2.2e-16 with a residual standard error of 4.736. To be honest, it is
quite hard for us to interpret this indicator since the observed values
are not in the same range. Later we will try to normalize our variables
to interpret it easily. Though, after plotting the model, we can see on
the Q-Q plot that the residuals are not normally distributed which means
that the linear regression model is not appropriate for our data. Still,
we will try the forward selection algorithm to show case our skills
🕶️.
The goal of these boxplots is to show how the data is distributed. We can see on the boxplot below that the every modalities have the average of 0 and the same range.
##
## Call:
## lm(formula = medv ~ ., data = normalized_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.69559 -0.29680 -0.05633 0.19322 2.84864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.726e-16 2.294e-02 0.000 1.000000
## crim -1.010e-01 3.074e-02 -3.287 0.001087 **
## zn 1.177e-01 3.481e-02 3.382 0.000778 ***
## indus 1.534e-02 4.587e-02 0.334 0.738288
## chas 7.420e-02 2.379e-02 3.118 0.001925 **
## nox -2.238e-01 4.813e-02 -4.651 4.25e-06 ***
## rm 2.911e-01 3.193e-02 9.116 < 2e-16 ***
## age 2.119e-03 4.043e-02 0.052 0.958229
## dis -3.378e-01 4.567e-02 -7.398 6.01e-13 ***
## rad 2.897e-01 6.281e-02 4.613 5.07e-06 ***
## tax -2.260e-01 6.891e-02 -3.280 0.001112 **
## ptratio -2.243e-01 3.080e-02 -7.283 1.31e-12 ***
## b 9.243e-02 2.666e-02 3.467 0.000573 ***
## lstat -4.074e-01 3.938e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.516 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + b + lstat, data = normalized_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.69601 -0.29777 -0.05487 0.18780 2.85278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.675e-16 2.289e-02 0.000 1.000000
## crim -1.014e-01 3.066e-02 -3.307 0.001010 **
## zn 1.163e-01 3.429e-02 3.390 0.000754 ***
## chas 7.508e-02 2.359e-02 3.183 0.001551 **
## nox -2.189e-01 4.454e-02 -4.915 1.21e-06 ***
## rm 2.904e-01 3.104e-02 9.356 < 2e-16 ***
## dis -3.418e-01 4.252e-02 -8.037 6.84e-15 ***
## rad 2.837e-01 6.003e-02 4.726 3.00e-06 ***
## tax -2.158e-01 6.180e-02 -3.493 0.000521 ***
## ptratio -2.228e-01 3.038e-02 -7.334 9.24e-13 ***
## b 9.223e-02 2.654e-02 3.475 0.000557 ***
## lstat -4.057e-01 3.682e-02 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.515 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
On our normalized data, we can see that our model didn’t change much. The only output that changed is the residual standard error which is 0.515. This value is quite high which confirm us that this model is not the best for our use case.
This time we will use the stepAIC function to apply the
forward selection algorithm. We will use the initial model as the lower
bound and the model with all the variables as the upper bound. Before
hand we used a different library.
## Start: AIC=1794.1
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 18882.8 15226 1470.2
## + rm 1 16276.0 17832 1534.1
## + ptratio 1 9523.9 24584 1663.8
## + tax 1 7627.9 26480 1693.8
## + indus 1 7324.6 26784 1698.4
## + nox 1 5913.2 28195 1719.2
## + rad 1 5021.7 29087 1731.8
## + crim 1 5018.5 29090 1731.8
## + age 1 4779.6 29329 1735.1
## + zn 1 4293.4 29815 1741.7
## + b 1 3449.2 30659 1753.0
## + dis 1 2093.9 32014 1770.5
## + chas 1 1056.1 33052 1783.4
## <none> 34108 1794.1
##
## Step: AIC=1470.24
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 2941.58 12284 1385.5
## + ptratio 1 2209.92 13016 1408.9
## + chas 1 796.68 14429 1450.5
## + dis 1 737.33 14488 1452.2
## + age 1 295.97 14930 1464.3
## + tax 1 158.27 15067 1468.0
## + crim 1 81.23 15144 1470.1
## + zn 1 79.62 15146 1470.1
## <none> 15226 1470.2
## + b 1 63.28 15162 1470.6
## + indus 1 37.42 15188 1471.2
## + nox 1 18.58 15207 1471.8
## + rad 1 3.99 15222 1472.1
##
## Step: AIC=1385.51
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1337.28 10947 1341.0
## + chas 1 641.96 11642 1365.8
## + dis 1 419.55 11864 1373.5
## + b 1 201.23 12083 1380.8
## + tax 1 178.43 12106 1381.6
## + crim 1 170.05 12114 1381.9
## + age 1 75.82 12208 1385.0
## <none> 12284 1385.5
## + rad 1 43.67 12240 1386.1
## + zn 1 19.76 12264 1386.9
## + indus 1 8.34 12276 1387.2
## + nox 1 0.25 12284 1387.5
##
## Step: AIC=1340.95
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 546.03 10401 1322.3
## + chas 1 414.26 10532 1327.4
## + b 1 169.40 10777 1336.7
## + age 1 131.58 10815 1338.1
## + crim 1 59.78 10887 1340.7
## <none> 10947 1341.0
## + rad 1 47.26 10899 1341.2
## + zn 1 22.12 10924 1342.1
## + indus 1 10.28 10936 1342.6
## + tax 1 2.79 10944 1342.8
## + nox 1 0.45 10946 1342.9
##
## Step: AIC=1322.27
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 499.37 9901.2 1304.4
## + chas 1 295.49 10105.1 1312.6
## + b 1 237.57 10163.0 1314.9
## + indus 1 196.54 10204.1 1316.6
## + zn 1 156.10 10244.5 1318.2
## + crim 1 142.85 10257.8 1318.7
## + tax 1 119.46 10281.2 1319.6
## <none> 10400.6 1322.3
## + age 1 25.87 10374.7 1323.3
## + rad 1 0.42 10400.2 1324.3
##
## Step: AIC=1304.4
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 350.10 9551.1 1291.8
## + zn 1 153.36 9747.9 1300.1
## + b 1 150.70 9750.6 1300.2
## + rad 1 89.12 9812.1 1302.7
## + crim 1 86.99 9814.3 1302.8
## <none> 9901.2 1304.4
## + indus 1 19.25 9882.0 1305.6
## + age 1 1.44 9899.8 1306.3
## + tax 1 0.80 9900.4 1306.4
##
## Step: AIC=1291.85
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 160.673 9390.5 1287.0
## + b 1 125.439 9425.7 1288.5
## + rad 1 108.724 9442.4 1289.2
## + crim 1 64.013 9487.1 1291.1
## <none> 9551.1 1291.8
## + indus 1 28.105 9523.0 1292.7
## + tax 1 1.193 9550.0 1293.8
## + age 1 0.436 9550.7 1293.8
##
## Step: AIC=1287
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn
##
## Df Sum of Sq RSS AIC
## + b 1 142.489 9248.0 1282.8
## + crim 1 104.162 9286.3 1284.5
## + rad 1 66.011 9324.5 1286.2
## <none> 9390.5 1287.0
## + indus 1 28.314 9362.2 1287.8
## + age 1 6.578 9383.9 1288.7
## + tax 1 5.502 9385.0 1288.8
##
## Step: AIC=1282.82
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + b
##
## Df Sum of Sq RSS AIC
## + rad 1 115.883 9132.1 1279.7
## + crim 1 64.248 9183.7 1282.0
## <none> 9248.0 1282.8
## + indus 1 19.159 9228.8 1284.0
## + age 1 2.476 9245.5 1284.7
## + tax 1 0.018 9248.0 1284.8
##
## Step: AIC=1279.73
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + b + rad
##
## Df Sum of Sq RSS AIC
## + crim 1 188.587 8943.5 1273.3
## + tax 1 186.216 8945.9 1273.4
## <none> 9132.1 1279.7
## + indus 1 33.416 9098.7 1280.2
## + age 1 7.356 9124.7 1281.4
##
## Step: AIC=1273.3
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + b + rad +
## crim
##
## Df Sum of Sq RSS AIC
## + tax 1 199.421 8744.1 1266.2
## <none> 8943.5 1273.3
## + indus 1 39.008 8904.5 1273.5
## + age 1 7.751 8935.8 1275.0
##
## Step: AIC=1266.19
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + b + rad +
## crim + tax
##
## Df Sum of Sq RSS AIC
## <none> 8744.1 1266.2
## + age 1 12.7502 8731.3 1267.6
## + indus 1 1.1286 8743.0 1268.1
On this Q-Q plot, we can see that the residuals are not normally distributed. This means that the model provided by the forward selection algorithm is not appropriate for our data…
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## zn + b + rad + crim + tax, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4118 -2.6786 -0.5805 1.7247 25.1597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.828112 5.692801 6.996 1.14e-11 ***
## lstat -0.575216 0.052701 -10.915 < 2e-16 ***
## rm 3.481624 0.461829 7.539 3.33e-13 ***
## ptratio -0.975927 0.144148 -6.770 4.71e-11 ***
## dis -1.535931 0.206057 -7.454 5.86e-13 ***
## nox -17.406957 3.923249 -4.437 1.19e-05 ***
## chas 3.150152 0.913474 3.449 0.000625 ***
## zn 0.046922 0.015137 3.100 0.002075 **
## b 0.007637 0.003160 2.416 0.016128 *
## rad 0.302551 0.068610 4.410 1.34e-05 ***
## crim -0.103341 0.034359 -3.008 0.002802 **
## tax -0.010953 0.003663 -2.990 0.002966 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.723 on 392 degrees of freedom
## Multiple R-squared: 0.7436, Adjusted R-squared: 0.7364
## F-statistic: 103.4 on 11 and 392 DF, p-value: < 2.2e-16
As we can see, the model is not very good. We can try to improve it by using a tree. Intuitively, we can see that our data is segmented in three groups. One group is the one with the lowest medv, one group is the one with the highest medv and the third group is in between. But let’s use trees to segment these groups correctly.
We can see that the two variables lstat and
rm are significant as we have deduced it intuitively
before.